Overview

Dataset statistics

Number of variables12
Number of observations135397
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory12.4 MiB
Average record size in memory96.0 B

Variable types

NUM10
CAT2

Reproduction

Analysis started2020-04-06 12:49:09.061681
Analysis finished2020-04-06 12:59:20.524814
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
odometer is highly skewed (γ1 = 38.90228426) Skewed
condition has 64710 (47.8%) zeros Zeros
fuel has 8902 (6.6%) zeros Zeros
odometer has 3219 (2.4%) zeros Zeros
type has 33237 (24.5%) zeros Zeros
paint_color has 25229 (18.6%) zeros Zeros

Variables

df_index
Real number (ℝ≥0)

UNIQUE
Distinct count135397
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean270462.1444
Minimum13
Maximum539752
Zeros0
Zeros (%)0.0%
Memory size1.0 MiB

Quantile statistics

Minimum13
5-th percentile28393.6
Q1136298
median268962
Q3405372
95-th percentile513016.2
Maximum539752
Range539739
Interquartile range (IQR)269074

Descriptive statistics

Standard deviation155804.7609
Coefficient of variation (CV)0.5760686444
Kurtosis-1.200028544
Mean270462.1444
Median Absolute Deviation (MAD)134906.0336
Skewness0.0101182272
Sum3.661976296e+10
Variance2.427512351e+10
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[1.300000e+01 9.545000e+02 3.701500e+03 4.028500e+03 6.405500e+03 ... 5.341785e+05 5.368050e+05 5.383260e+05 5.391885e+05 5.397520e+05], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
264191 1 < 0.1%
 
116242 1 < 0.1%
 
161252 1 < 0.1%
 
400873 1 < 0.1%
 
407020 1 < 0.1%
 
409069 1 < 0.1%
 
495966 1 < 0.1%
 
443890 1 < 0.1%
 
358830 1 < 0.1%
 
189942 1 < 0.1%
 
Other values (135387) 135387 > 99.9%
 
ValueCountFrequency (%) 
13 1 < 0.1%
 
25 1 < 0.1%
 
28 1 < 0.1%
 
29 1 < 0.1%
 
31 1 < 0.1%
 
ValueCountFrequency (%) 
539752 1 < 0.1%
 
539744 1 < 0.1%
 
539735 1 < 0.1%
 
539732 1 < 0.1%
 
539724 1 < 0.1%
 

price
Real number (ℝ≥0)

Distinct count5183
Unique (%)3.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean11767.09613
Minimum1001
Maximum39999
Zeros0
Zeros (%)0.0%
Memory size1.0 MiB

Quantile statistics

Minimum1001
5-th percentile2500
Q15250
median8999
Q316400
95-th percentile28990
Maximum39999
Range38998
Interquartile range (IQR)11150

Descriptive statistics

Standard deviation8368.352679
Coefficient of variation (CV)0.7111654893
Kurtosis0.5840741446
Mean11767.09613
Median Absolute Deviation (MAD)6732.04321
Skewness1.096526073
Sum1593229515
Variance70029326.55
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1001. 1099.5 1105.5 1192.5 1199.5 ... 39990.5 39993. 39996. 39998.5 39999. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
3500 1755 1.3%
 
6995 1692 1.2%
 
4995 1664 1.2%
 
4500 1652 1.2%
 
5995 1590 1.2%
 
5500 1589 1.2%
 
7995 1587 1.2%
 
6500 1476 1.1%
 
3995 1467 1.1%
 
8995 1466 1.1%
 
Other values (5173) 119459 88.2%
 
ValueCountFrequency (%) 
1001 1 < 0.1%
 
1005 1 < 0.1%
 
1024 1 < 0.1%
 
1028 1 < 0.1%
 
1050 5 < 0.1%
 
ValueCountFrequency (%) 
39999 34 < 0.1%
 
39998 14 < 0.1%
 
39997 4 < 0.1%
 
39995 72 0.1%
 
39991 2 < 0.1%
 

year
Real number (ℝ≥0)

Distinct count96
Unique (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2009.046618
Minimum1908
Maximum2021
Zeros0
Zeros (%)0.0%
Memory size1.0 MiB

Quantile statistics

Minimum1908
5-th percentile1998
Q12006
median2010
Q32014
95-th percentile2017
Maximum2021
Range113
Interquartile range (IQR)8

Descriptive statistics

Standard deviation7.501392696
Coefficient of variation (CV)0.003733807183
Kurtosis18.13918395
Mean2009.046618
Median Absolute Deviation (MAD)5.131765936
Skewness-2.963946261
Sum272018885
Variance56.27089238
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[1908. 1928.5 1941.5 1946.5 1961.5 ... 2017.5 2018.5 2019.5 2020.5 2021. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
2013 10079 7.4%
 
2012 9689 7.2%
 
2011 9171 6.8%
 
2008 9104 6.7%
 
2014 9082 6.7%
 
2015 8801 6.5%
 
2007 8094 6.0%
 
2010 7644 5.6%
 
2016 7567 5.6%
 
2006 7105 5.2%
 
Other values (86) 49061 36.2%
 
ValueCountFrequency (%) 
1908 3 < 0.1%
 
1923 6 < 0.1%
 
1925 1 < 0.1%
 
1926 1 < 0.1%
 
1927 2 < 0.1%
 
ValueCountFrequency (%) 
2021 1 < 0.1%
 
2020 132 0.1%
 
2019 2129 1.6%
 
2018 3509 2.6%
 
2017 6247 4.6%
 

manufacturer
Real number (ℝ≥0)

Distinct count41
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean19.20192471
Minimum0
Maximum42
Zeros1163
Zeros (%)0.9%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile5
Q110
median14
Q332
95-th percentile40
Maximum42
Range42
Interquartile range (IQR)22

Descriptive statistics

Standard deviation11.93364378
Coefficient of variation (CV)0.6214816464
Kurtosis-1.039684824
Mean19.20192471
Median Absolute Deviation (MAD)10.31938783
Skewness0.5597150933
Sum2599883
Variance142.411854
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 0. 0.5 1.5 2.5 3.5 ... 38.5 39.5 40.5 41.5 42. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
13 26261 19.4%
 
7 21166 15.6%
 
40 11189 8.3%
 
32 7698 5.7%
 
17 7481 5.5%
 
21 6254 4.6%
 
35 5989 4.4%
 
14 5986 4.4%
 
10 4904 3.6%
 
4 3606 2.7%
 
Other values (31) 34863 25.7%
 
ValueCountFrequency (%) 
0 1163 0.9%
 
1 30 < 0.1%
 
2 1 < 0.1%
 
3 1468 1.1%
 
4 3606 2.7%
 
ValueCountFrequency (%) 
42 923 0.7%
 
41 3060 2.3%
 
40 11189 8.3%
 
39 4 < 0.1%
 
38 3402 2.5%
 

condition
Real number (ℝ≥0)

ZEROS
Distinct count6
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.138082823
Minimum0
Maximum5
Zeros64710
Zeros (%)47.8%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median1
Q32
95-th percentile3
Maximum5
Range5
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.152550691
Coefficient of variation (CV)1.012712491
Kurtosis-1.388710329
Mean1.138082823
Median Absolute Deviation (MAD)1.094926788
Skewness0.2596369366
Sum154093
Variance1.328373095
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5 1.5 2.5 3.5 5. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 64710 47.8%
 
2 51790 38.3%
 
3 14857 11.0%
 
1 3473 2.6%
 
4 366 0.3%
 
5 201 0.1%
 
ValueCountFrequency (%) 
0 64710 47.8%
 
1 3473 2.6%
 
2 51790 38.3%
 
3 14857 11.0%
 
4 366 0.3%
 
ValueCountFrequency (%) 
5 201 0.1%
 
4 366 0.3%
 
3 14857 11.0%
 
2 51790 38.3%
 
1 3473 2.6%
 

cylinders
Real number (ℝ≥0)

Distinct count8
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.663810867
Minimum0
Maximum7
Zeros916
Zeros (%)0.7%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile3
Q13
median5
Q36
95-th percentile6
Maximum7
Range7
Interquartile range (IQR)3

Descriptive statistics

Standard deviation1.265828633
Coefficient of variation (CV)0.2714150872
Kurtosis-0.3984356884
Mean4.663810867
Median Absolute Deviation (MAD)1.085501421
Skewness-0.6598825671
Sum631466
Variance1.602322128
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
5 50032 37.0%
 
6 41993 31.0%
 
3 40725 30.1%
 
4 1286 0.9%
 
0 916 0.7%
 
7 238 0.2%
 
2 156 0.1%
 
1 51 < 0.1%
 
ValueCountFrequency (%) 
0 916 0.7%
 
1 51 < 0.1%
 
2 156 0.1%
 
3 40725 30.1%
 
4 1286 0.9%
 
ValueCountFrequency (%) 
7 238 0.2%
 
6 41993 31.0%
 
5 50032 37.0%
 
4 1286 0.9%
 
3 40725 30.1%
 

fuel
Real number (ℝ≥0)

ZEROS
Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.891038945
Minimum0
Maximum4
Zeros8902
Zeros (%)6.6%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile0
Q12
median2
Q32
95-th percentile2
Maximum4
Range4
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.5383426671
Coefficient of variation (CV)0.2846808991
Kurtosis8.904212474
Mean1.891038945
Median Absolute Deviation (MAD)0.2499516312
Skewness-2.328496096
Sum256041
Variance0.2898128272
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
2 124244 91.8%
 
0 8902 6.6%
 
3 1157 0.9%
 
4 996 0.7%
 
1 98 0.1%
 
ValueCountFrequency (%) 
0 8902 6.6%
 
1 98 0.1%
 
2 124244 91.8%
 
3 1157 0.9%
 
4 996 0.7%
 
ValueCountFrequency (%) 
4 996 0.7%
 
3 1157 0.9%
 
2 124244 91.8%
 
1 98 0.1%
 
0 8902 6.6%
 

odometer
Real number (ℝ≥0)

SKEWED
ZEROS
Distinct count303
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean22.7956971
Minimum0
Maximum2000
Zeros3219
Zeros (%)2.4%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile3
Q113
median22
Q330
95-th percentile43
Maximum2000
Range2000
Interquartile range (IQR)17

Descriptive statistics

Standard deviation26.2441299
Coefficient of variation (CV)1.151275602
Kurtosis2474.675635
Mean22.7956971
Median Absolute Deviation (MAD)10.55983543
Skewness38.90228426
Sum3086469
Variance688.7543544
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0.0000e+00 5.0000e-01 1.5000e+00 2.5000e+00 3.5000e+00 ... 4.0050e+02 5.1700e+02 7.5450e+02 1.9995e+03 2.0000e+03], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
19 4675 3.5%
 
20 4642 3.4%
 
24 4545 3.4%
 
22 4396 3.2%
 
18 4382 3.2%
 
23 4251 3.1%
 
25 4165 3.1%
 
21 4148 3.1%
 
26 4136 3.1%
 
16 4000 3.0%
 
Other values (293) 92057 68.0%
 
ValueCountFrequency (%) 
0 3219 2.4%
 
1 946 0.7%
 
2 1606 1.2%
 
3 1837 1.4%
 
4 2117 1.6%
 
ValueCountFrequency (%) 
2000 2 < 0.1%
 
1999 5 < 0.1%
 
1737 1 < 0.1%
 
1668 2 < 0.1%
 
1560 1 < 0.1%
 

transmission
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size1.0 MiB
0
120729
1
 
9317
2
 
5351
ValueCountFrequency (%) 
0 120729 89.2%
 
1 9317 6.9%
 
2 5351 4.0%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Decimal_Number 3 100.0%
 
ValueCountFrequency (%) 
Common 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 

drive
Categorical

Distinct count3
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size1.0 MiB
0
60137
1
45567
2
29693
ValueCountFrequency (%) 
0 60137 44.4%
 
1 45567 33.7%
 
2 29693 21.9%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Decimal_Number 3 100.0%
 
ValueCountFrequency (%) 
Common 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 

type
Real number (ℝ≥0)

ZEROS
Distinct count13
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.144242487
Minimum0
Maximum12
Zeros33237
Zeros (%)24.5%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile0
Q12
median8
Q39
95-th percentile11
Maximum12
Range12
Interquartile range (IQR)7

Descriptive statistics

Standard deviation4.143149038
Coefficient of variation (CV)0.6743140504
Kurtosis-1.400031046
Mean6.144242487
Median Absolute Deviation (MAD)3.811965893
Skewness-0.4984001379
Sum831912
Variance17.16568395
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
9 34110 25.2%
 
0 33237 24.5%
 
10 21732 16.1%
 
8 15920 11.8%
 
3 8144 6.0%
 
4 4752 3.5%
 
11 4658 3.4%
 
12 3946 2.9%
 
2 3300 2.4%
 
5 3147 2.3%
 
Other values (3) 2451 1.8%
 
ValueCountFrequency (%) 
0 33237 24.5%
 
1 138 0.1%
 
2 3300 2.4%
 
3 8144 6.0%
 
4 4752 3.5%
 
ValueCountFrequency (%) 
12 3946 2.9%
 
11 4658 3.4%
 
10 21732 16.1%
 
9 34110 25.2%
 
8 15920 11.8%
 

paint_color
Real number (ℝ≥0)

ZEROS
Distinct count12
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5.635331654
Minimum0
Maximum11
Zeros25229
Zeros (%)18.6%
Memory size1.0 MiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median7
Q39
95-th percentile10
Maximum11
Range11
Interquartile range (IQR)8

Descriptive statistics

Standard deviation3.985860394
Coefficient of variation (CV)0.7072982813
Kurtosis-1.582309131
Mean5.635331654
Median Absolute Deviation (MAD)3.665926811
Skewness-0.2840279952
Sum763007
Variance15.88708308
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
10 32045 23.7%
 
0 25229 18.6%
 
9 20417 15.1%
 
5 15707 11.6%
 
1 14416 10.6%
 
8 14219 10.5%
 
4 4226 3.1%
 
2 3841 2.8%
 
3 3160 2.3%
 
11 967 0.7%
 
Other values (2) 1170 0.9%
 
ValueCountFrequency (%) 
0 25229 18.6%
 
1 14416 10.6%
 
2 3841 2.8%
 
3 3160 2.3%
 
4 4226 3.1%
 
ValueCountFrequency (%) 
11 967 0.7%
 
10 32045 23.7%
 
9 20417 15.1%
 
8 14219 10.5%
 
7 378 0.3%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

df_indexpriceyearmanufacturerconditioncylindersfuelodometertransmissiondrivetypepaint_color
01379952010706238001010
12540001995100622600105
2281600020114052170195
329109502011505280198
431940020114252290001
5354500201213052310099
6401495200418252390008
7422800200232352380009
8482590020081306214021010
9624999200719252220293

Last rows

df_indexpriceyearmanufacturerconditioncylindersfuelodometertransmissiondrivetypepaint_color
13538753970127755201113060310088
13538853970222457200835250440008
13538953971113500201432032130005
13539053971227002002175323601910
1353915397143950200918032250191
1353925397248995200724052370090
13539353973294572008252623800010
1353945397357455201340232270190
1353955397446300201432232170195
1353965397525295200630323000123